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Abstract 



We present an automated procedure to assign CATH and SCOP classifications to proteins 
whose FSSP score is available. CATH classification is assigned down to the topology level and 
SCOP classification to the fold level. As the FSSP database is updated weekly, this method 
makes it possible to update also CATH and SCOP with the same frequency. Our predictions 
have a nearly perfect success rate when ambiguous cases are discarded. These ambiguous 
cases are intrinsic in any protein structure classification, which relies on structural information 
alone. Hence, we introduce the notion of "twilight zone for structure classification". We 
further suggest that in order to resolve these ambiguous cases other criteria of classification, 
based also on information about sequence and function, must be used. 
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INTRODUCTION 



The first step to analyze the vast amount of information provided by genome sequencing 
projects is to organize proteins (the gene products) into classes with similar properties. Since 
during evolution protein structures are much more conserved than sequences and functions Q , 
proteins are usually classified first by their structural similarity (phenetic classification) and 
then by the similarity of their sequences or by the similarity of their functions (phylogenetic 
classification) 0. 

A reliable structural classification scheme is useful for several reasons. Perhaps the most 
exciting perspective is the possibility to routinely assign a function to newly identified genes 
[P. This goal may be achievable as a classified database provides a library of representative 
structures to perform prediction of protein structure by homology or by threading 
and it allows for the identification of distant evolutionary relationships P|. In addition, given 
a particular protein, it provides a tool to identify other proteins of similar structure and 
function |jl^. The knowledge of the structure helps to reveal the mechanism of molecular 
recognition, involved in catalysis, signalling and binding [0] and may lead to the rational 



design of new drugs |11|. At a more abstract level, the physical principles dictating structural 
stability of proteins are revealed by their folded state. Therefore most of the recently proposed 
methods to derive energy functions to perform protein fold predictions rely in different ways 
on structural data p^ , p^ . 

The most comprehensive repository of three dimensional structures of proteins is the 
Protein Data Bank (PDB) [l^. The number of released structures is increasing at the pace 
of about 50 per week and more than 12,000 complete sets of coordinates were available at the 
time of writing. Many research groups maintain web-accessible hierarchical classifications of 
PDB entries. The most widely used are: FSSP pl, CATH pi, SCOP 0, HOMSTRAD 



, MMDB 1 19] and 3Dee |20| (See Table |^ for a list of abbreviations). Here we consider 



three of these; the FSSP, the CATH and the SCOP databases. Each group has its own way 
to compare and classify proteins; these three classification schemes are, however, consistent 



with each other to a large extent |21,2 



The FSSP database 

The FSSP (Fold classification based on Structure-Structure alignment of Proteins) uses a fully 



automated structure comparison algorithm, DALI (Distances ALIgnment algorithm) 123, 24|, 
to calculate a pairwise structural similarity measure (the S-score) between protein chains. 

The algorithm searches for that amino acid alignment between the two protein chains, 
which yields the most similar pair of C^ distance maps. In general, the more geometrically 
similar are two chain structures, the higher is their S-score. The mean and standard deviations 
of the S-scores obtained for all the pairs of proteins are evaluated. Shifting the S-scores by 
their mean and rescaling by the standard deviation yield the statistically meaningful Z-scores. 

For classification of structures the FSSP uses the Z-scores for all pairs in a representative 
sub-set of the PDB. A fold tree is generated by applying an average-linkage hierarchical 
clustering algorithm [^5| to this all-against-all Z-score matrix. An alternate classification 



based on a more common four- level hierarchy is also available |24|. 
CATH database 

Thornton and coworkers use a combination of automatic and manual procedures to create 
a hierarchical classification of domains (CATH) |jl^. They arrange domains in a four level 
hierarchy of families according to the protein class (C), architecture {A ), topology (T ) and 
homologous superfamily {H). The class level describes the secondary structures found in the 
domain |2^ and is created automatically. There are four class types: mainly-a, mainly-/?, 
a-/3and proteins with few secondary structures (FSS). The architecture level, on the other 



hand, is assigned manually (using human judgement) and describes the shape created by the 
relative orientation of the secondary structure units. The shape families are chosen according 
to a commonly used structure classification, like barrel, sandwich, roll, etc. The topology level 
groups together all structures with similar sequential connectivity between their secondary 
structure elements. Structures with high structural and functional similarity are put in the 
same fourth level family, called homologous superfamily. Both the topology and homologous 
superfamily levels are assigned by thresholding a calculated structural similarity measure 
(SSAP) at two different levels, respectively |2^,^. The CATH database has been recently 
linked to the DHS (Dictionary of Homologous Superfamilies) database which allows to 
further analyze structural and functional features of evolutionary related proteins. There is a 
growing need for annotating proteins classified in structural databases as structural genomic 
initiatives are providing a large number of new proteins whose function might be gathered 
by distant homology informations. 



The SCOP database 

The SCOP (Structural Classification of Proteins) |]17| database is organized hierarchically. 
The lower two levels (family and superfamily) describe near and distant evolutionary rela- 
tionship, the third (fold) describes structural similarity and the top level (class) the secondary 



structure content |26|. SCOP is linked to the ASTRAL compendium [Q, which provides 
a series of tools for further analysis of the classified structures, mainly through the use of 
their sequence. At variance with FSSP and CATH, SCOP is constructed manually, by visual 
inspection and comparison of not only structures but also sequences and functions. 



Automated assignment of SCOP and CATH classifications 

In this work we present a method. Classification by Optimization (CO), to predict without 
human intervention the SCOP fold level and the CATH topology level from the FSSP pairwise 
structure similarity score. A protein for which the Z score is available is classified into a SCOP 
fold and into a CATH topology by the CO method, an optimization procedure which finds 
the assignment of minimal cost, where the cost is defined in terms of Z scores (see Methods). 
The query for the classification of any such protein can be submitted to the web site 



RESULTS 

Consistency of the FSSP, CATH and SCOP classifications 

We found that the FSSP and CATH databases are consistent [^]. In this section we show 
that SCOP is also consistent with these to a large extent (see also [^). In the rest of 
this work we will use this fact to derive an automated procedure to assign the CATH and 
SCOP classifications starting from the FSSP Z scores (which are updated weekly) in a fully 
automated fashion to include new releases in the PDB [j^. Here we further discuss the 
consistency of the three classification schemes by introducing concepts and quantities that 
will be later used in the prediction of the CATH and SCOP classifications. 

We first illustrate the correlation between the FSSP similarity score and the CATH clas- 
sification. A simple and visually appealing way to study this problem is shown in Figure 
||. The element Zij of the Z-score matrix (Figure |l|a) represents the score for superimposing 
structure i with structure j of the set PFrCs (a subset of the proteins in FSSP and CATH, 
see Table ^ and Methods) using the DALI algorithm 1 23 , 24 ] . In Figure ||a only the pairs with 



Z > 2 are shown, therefore the matrix is sparse and the proteins are ordered in a random 
fashion. Figure |l]b is produced by reordering the rows and columns of the original Z-score 
matrix (Figure |l|a) . The reordering is performed according to the CATH classification in the 
following way: For each of the proteins in this set we have the CATH classifications at all 



levels. First, we order the proteins by their class; within the class, by the architecture; within 
it by the topology and so on. This reordering generates a permutation of the columns and 
rows of the Z matrix. The solid black grid in Figure ^3 separates the proteins according to 
their CATH class and a thin grid is placed at the boundaries between architectures. 

Figure shows the underlying order behind the apparent randomness of Figure |l|a and 
reveals the extent to which the FSSP Z-scores reflect the CATH classification. 

Several interesting observations can be made. First, consider the Class level of CATH. 
As can be seen in Figure |l]b, there are no matrix elements with Z > 2.0 in region A, that 
connects proteins of the mainly-a class to the mainly-/? class. At variance with this, some 
proteins from both of these classes have large Z scores with proteins from the a-/3 class (region 
B). This is reasonable, because of the way similarity is defined by FSSP; a mainly-ct protein 
can have a high Z-score with an protein, due to high similarity with the a part. Second, 
turning to the Architecture level, we observe that there are architecture families which are 
highly connected within themselves, e.g. a-/3 barrels (482-525: region C), whereas for others 
the intra-family connections are more sparse. The similarities within the mainly-/3 sandwich 
family (318-406: region D) have two relatively distinct subgroups which suggest an inner 
structure corresponding to the lower levels in the CATH hierarchy. Checking the topology 
level (the third CATH level) for this architecture, one indeed finds two large topology sub- 
families, the immunoglobulin-like proteins (324-366: upper left part of region D) and the 
Jelly- Rolls (373-402: lower right part of region D), which correspond precisely to the two 
strongly connected sub-groups that appear in Figure ||b. 

We found that the CATH classification at the level of topology is reflected in the Z- 
matrix. This is to be expected since the Z-score measures the structural similarity of two 
aligned proteins, while preserving their connectivity. Overall, this analysis shows that the 
Z-matrix is correlated with the CATH classification. In a similar way it is possible to show 
that the Z score is correlated with the SCOP classification. The results are available at the 
web site 

These findings suggest that Z scores can be used to predict the CATH and SCOP clas- 
sifications of yet unclassified proteins. In what follows, we will demonstrate that this indeed 
can be done. We also estimate the success rate of our predictions and provide a web site |31| 
that can be used to retrieve our predictions for the CATH topology and the SCOP fold for 
new entries in FSSP. 

We also verified that the CATH and SCOP classifications are to a large extent mutually 
compatible. An immediate consequence of this is that it is possible to construct a "translation 
table", T, from the proteins that have already both a CATH and a SCOP classification. In 
this way, given a CATH entry one can obtain the corresponding SCOP classification (see Fig. 
^). Row i of the table refers to a particular CATH topology and column j to a particular 
SCOP fold. The element Tjj of the table is the measured fraction of times that a protein has 
a CATH topology i and a SCOP fold j. This number is calculated by enumerating all the 
10197 single-domain proteins with known CATH and SCOP classifications (PCsSs) and it is 
an estimate of Tjj, the joint probability distribution for a protein to have CATH topology i 
and SCOP fold j. Had the CATH and SCOP classifications been independent, every element 
Tij could have been expressed as a product of Cj, the fraction of proteins that belong to 
CATH topology and /Sj, the fraction that belongs to SCOP fold j, i.e. Tjj = Q * Sj. 
Randomly placing 10197 proteins using such a probability distribution yields 4780 ± 40 non- 
zero elements in the matrix. In the other extreme case, had there been a full correspondence 
between the SCOP and CATH classifications, the table would have had a single non-zero 
element in each row and column (in each CATH topology row the non-zero element would 
have been in that SCOP fold column which corresponds to it). In this case, the proteins 
in PCsSs would have been distributed among 284 non-zero elements (the number of distinct 
CATH topologies in PCsSs). 

We found 369 non-zero elements in T, meaning that the CATH and SCOP classifications 



are highly dependent. Still, the correspondence is not entirely one-to-one; in general, more 
than one SCOP fold corresponds to a given CATH topology. The number of such folds 
is, however, typically small. Such a translation table may be used to predict the SCOP 
classification of a structure already classified in CATH, or at least to significantly restrict the 
number of possibilities, (and vice versa). For example, the assignment of the CATH topology 
to a protein with known SCOP fold can be done by selecting the CATH topology with the 
largest value in the translation table for that particular SCOP fold. Such an assignment is 
correct in 93% of the cases. The corresponding assignment of the SCOP fold from the CATH 
topology is correct in 82% of the cases. Although this is possibly useful information, in this 
work we do not assign classifications in this way. 

Summary of the CO Classification Performance 

Every time the FSSP Z scores are updated (once a week) the CO classification can be applied 
to all the proteins that appear in the new FSSP release but are not yet classified in CATH 
or in SCOP. The possible outcomes of the classification procedure are: 

1. Correct classification: the predicted classification will agree with the future release of 
the databases. 

2. Rejection: the program is unable to classify the structure. 

3. Ambiguous classification: a classification is returned (both for CATH and SCOP) but 
a later release provides a different classification. 

The frequencies of these outcomes greatly depends on the statistics of the set of proteins 
to be classified. More specifically, rejected proteins are of two types: proteins that do not 
have high Z score with any other proteins ("islands" - see Methods) and clusters of proteins 
that are similar among themselves but do not have high Z score with other proteins outside 
their cluster ( "super- islands" ) . The fraction of islands and super- islands is a feature of the 
particular set of proteins to be classified. The occurrence of a super-island suggests that a 
new classification type (a new topology for CATH and and new fold for SCOP) might be 
needed. The work of maintaining CATH and SCOP can be thus focused on the classification 
of a representative from each of these super-islands. 

For the set PFCs the fraction of islands and superislands is 5%. We used this set to 
provide an upper bound for the performance of the CO method (see below); however, for the 
set PFC the fraction of rejections goes up to 22%. If rejections are not counted, we classify 
correctly 98% of the PFCs proteins. On the other hand, we could test our predictions also 
against the new CATH release v2.0. Out of 1582 proteins that were assigned to previously 
existing CATH topologies, CO has classified correctly 80%. The difference in success rates 
between PFCs and PFC is due to the different way in which the test set is nested in the 
larger set of structures with known classification. In the first case the test set consisted of 
20% of the members of PFCs, selected at random; the remaining 80% were used to "predict" 
the classification of the test set. In the second case, the members of CATH vl.7 were used to 
predict the classification of the new proteins that were added when CATH v2.0 was released. 
These new structures are not distributed uniformly at random among the members of CATH 
vl.7. 

Ambiguous classifications are due to two different mechanisms. The first stems from a well 
known problem with the way the FSSP similarity index is calculated (the "Russian doll effect" 
- see below). The second kind of "mistake" is actually not a wrong classification; rather, 
it happens when the newly classified structure lies within the ambiguous "twilight zone" 
between two closely related topologies (for CATH) or folds (for SCOP), as demonstrated in 
detail below. 



Automated assignment of CATH classification from FSSP 



In this section we describe the procedure that we used to predict CATH topology level from 
the FSSP scores. We identified a set of 7431 proteins (PFC , see Methods) that appear in 
FSSP but were not yet processed by CATH 1.7. Our goal is to predict the CATH topology of 
these 7431 proteins by using (1) the Z-scores between all proteins in PF (see Methods) and 
(2) the known classifications of the set PFrCs (see Methods). 

Predicting topologies is a classification problem which we treated with pattern recogni- 
tion tools. We tested several prediction algorithms using cross-validation to estimate their 
performance [21|. Every one of the algorithms that were tested can be viewed as a two-stage 
process. In the first stage a new similarity measure is produced from the original Z-scores. 
This is done either by a direct rescaling of the original Z-scores, or by using the results 
of various hierarchical clustering methods to produce new similarity measures. The second 
stage consists of using these similarities as the input to some classification method, yielding 
predictions for the classes and architectures. In this work we present only results obtained 
by one particular method - CO - which uses the original Z score as a similarity measure (see 
Methods) . A complete list of the results obtained by using other methods can be found in ||21| 



which is available on the web site [31|. 

Our final assignments for the set PFC using the CO method are listed in the web site. 
A more illustrative way to present these results is shown in Figure ^. In Figure ^ we present 
the Z score matrix for the combined set PFrCs+PFC . The submatrix in the upper left corner 
is the reordered Z score matrix of the set PFrCs which was already shown in Figure |b. The 
rest of the matrix in Figure ^a presents the Z scores of PFrCs with the set PFC (randomly 
ordered) and the Z scores of PFC among themselves. In Figure |3|b we reordered the rows and 
columns whose index > 860, corresponding to proteins in PFC . Whereas in the matrix of 
Figure |^ these proteins appear in a random order, in ^ they appear in the order imposed by 
our prediction of their CATH topology. One can see that the original order in the submatrix 
PFrCs is propagated by our assignment procedure to the set PFC . For example, focus on the 
small black square at the upper left corner of the matrix. This small black square represents 
the high Z-scores among the mainly-a class of proteins in PFrCs. In the corresponding top 
rows of the full matrix we see high Z-scores between these structures and some proteins from 
PFC . In particular, the small group with indices near 2476, are "close" to these mainly-a 
structures, and hence are also classified as such. On the other hand there is a large group of 
structures from PFC (between 861 and 2476), which do not have high Z-scores with any of 
the proteins in PFrCs or with any of the other structures in PFC with index > 2476. Hence 
we are unable to classify this group of structures on the basis of their FSSP scores. 

Figure ^b illustrates the central idea of this work. We perform a task which is intermediate 
between clustering and classification. We take proteins of known classification and we use 
them as fixed a priori values in a clustering procedure. 

The overall success rate of our prediction estimated by cross-validation was 93%. In order 
to understand the significance of these success rates we derived a statistical (see Methods) 
upper bound for this kind of prediction. This upper bound is 95% (see Methods), hence the 
figure of 93/95=98% given above 

We estimated the accuracy of the prediction by using the following procedure. First the 
set PFCs was randomly "diluted" ; that is, we randomly chose a certain fraction of the proteins 
in PFCs and placed them in a test set, pretending that we do not know their classification. 
The FSSP scores of the entire set were then used to classify the test set. For each protein from 
the test set we either return a predicted classification or reject the protein (i.e. we declare 
that we are unable to classify it). The quality of any classification algorithm (see Methods) 

^ One must keep in mind that the estimated success rate is calculated for all proteins; both FSSP repre- 
sentatives (~ 10% of the proteins) and non-representatives. Since the presence of homologous proteins can 
create a bias in these estimates, we also tested the success rate of predicting the CATH topology only for the 
FSSP representatives which yielded 63%, to be compared with the corresponding upper bound of 74%. 



is measured by its success rate (fraction of correctly classified proteins, out of the test set) 
and by the purity (success rate out of the non-rejected proteins). For the CO method the 
results were 93% for the success rate and 98% for the purity (using a dilution of 20%). More 
extensive tests at other dilutions and for other methods are of classification are discussed 
in and available at the web site pi] ]. 

We also tested directly the reliability of the CO assignments by using the CATH version 
2.0 (PC2). In PC2, 1640 single domain proteins that are present in PFC were assigned to 
one of the topologies that existed in vl.7. 58 of these we "rejected". In 1266 cases out of the 
remaining 1582 (80%) our prediction agrees with the one given in CATH v2.0. Almost all 
the cases in which we misassigned a domain can be explained in a simple way. These cases 
are discussed in detail in a following section. 

The CO method can also be used to predict directly the C level and the A level of CATH. 
We found that when the C and A levels were predicted as a byproduct of predicting the T 
level, the resulting C and A were consistent with those predicted directly. 



Automated assignment of SCOP classification from FSSP 

We used the CO method to predict the SCOP fold for a set of 3451 proteins (PFS) that 
belong PF but not yet in PS. The results are available on the web site |31]. The estimated 
success rate (by cross-validation) was 93%. As in the case of CATH this number increased 
when we discarded proteins in the "twilight zone" (see next section). 



Twilight zone for protein classification 

The attempt to assign a new protein to a known fold might lead to frustration because at times 
one is undecided about two or more possibilities. In order to assess that two proteins have 
similar structures, a similarity score is needed. FSSP uses the Z score, CATH uses the SSAP 
score and SCOP a subjective evaluation, which is also a kind of score. The problem arises 
when the protein to be classified has high scores with two proteins already classified, but to 
different topologies. In this paper these proteins are called "borders" (see Methods). Being or 
not a border protein depends on the similarity score. We showed, however, that FSSP, CATH 
and SCOP are to a large extent consistent classifications. We therefore suggest that there 
are "intrinsically" ambiguous cases ~ cases that are unavoidable in structure comparison. We 
refer to these ambiguous regions in structure space as the "twilight zone" in analogy with 
the case of protein sequence comparison where proteins with sequence similarity below 30% 
can not be reliably assigned to the same fold. We illustrate this concept by a typical case, 
shown in Fig. ^ This is a border protein. Protein Idhn (the central one) is the one to be 
classified (in fact, it is a 3-layer sandwich according to CATH). It has a Z score of 9.3 with 
protein laSrA (on the left) which is a 3-layer sandwich topology and a Z score of 8.7 with 
protein lb66A (on the right) which is a 2-layer sandwich topology. This example illustrates 
how structural information alone might not provide a clear-cut criterion for classification of 
this protein. The incidence of the twilight zone is shown in Fig. |5[ In Fig. ^a we present the 
histogram of the number of protein pairs that have different CATH topologies as a function of 
their Z score. This number is a rapidly decaying function of Z . On the contrary, the number 
of pairs with the same CATH topology is a slowly decaying function of Z. For Z > 3 the 
probability of having the same CATH topology becomes greater than that of having different 
topologies. For Z > 7.5 the probability to have the same topology is 97.5 %. In |5|b we show 
the corresponding figure for SCOP. The number of folds in SCOP is larger than the number 
of topologies in CATH, therefore there is more ambiguity. However, also in this case for 
Z > 7.5 the probability to have the same topology is 93.5 %. Taken together these results 
indicate that the twilight zone for structure comparison can be bound hy Z < 7. 

There are other cases in which the classification of a particular protein is inconsistent 
with that of all its neighbors. For example, proteins that we called "colonies" (see Methods) 



are such that none of their neighbors are of their own kind. This means that the FSSP scores 
imply that these proteins are similar only to proteins of different classes and architectures. 
Identifying these proteins can also focus the attention to possible misclassification or to 
drawbacks of the Z-score. For example, one of the 49 colonies (at the architecture level) that 
we found in CATH is the PDB entry IrboC, which is classified as a a-/3 two-layer sandwich. 
It has 15 neighbors in PC, 14 of which are classified as mainly-/? sandwiches. 

We summarize the results about the assignments of the CATH architecture for proteins 
that already have a CATH classification (PFCs) in a "confusion table" (see Table |2|). The first 
column lists the "correct" classification (as given in CATH vl.7 for the test set); the second 
column gives the assignments by CO (correct, incorrect or reject) and the third column lists 
the corresponding percentages. A full list of the inconsistent proteins is available on the web 
site g. 

Another problem is that there are some large Z-scores between proteins of different ar- 
chitectures. Such large Z-scores arise when a protein of one particular architecture has a 
similar structure to a part of a protein of a different architecture. Swindells et al. call the 
phenomenon of structures within structures, the "Russian doll" effect |32]. Such cases are 
common between architectures of long proteins that contain sub-structures corresponding 
to architectures of shorter proteins, e.g. there are many two-layer sandwich proteins that 
resemble a part of three-layer sandwich proteins. Such relationships can occur at the class 
level, e.g. a-/3 proteins that contain mainly-a or mainly-/? proteins (IrboC, IhgeA). They 
can also occur at the architecture level within the same class, e.g. a-(5 complex architecture 
contains a-/3 two-layer sandwich (IregX). Other inconsistencies occur when proteins fit two 
architecture definitions. 



Class prediction using the web site 

In order to retrieve our prediction for the CATH topology or SCOP fold of a protein, one can 
use the web site by entering the protein chain identifier in the search box and submitting 
the query. If the protein appears in our database then a table will be returned containing both 
the known and the predicted SCOP and CATH classifications. For example, the submission 
of the chain identifier "IcuoA" returns Table |^. This protein was classified by neither CATH 
vl.7 nor SCOP 1.53 which are the basis of our predictions. We predicted it to belong to 
CATH topology 2.60.40 and SCOP fold 2.5. Later, the release CATH v2.0 identified IcuoA 
as 2.60.40. 



CONCLUSIONS 

The rapidly increasing number of experimentally derived protein structures requires a contin- 
uous updating of the existing structure classification databases. Each group adopts different 
classification criteria at the level of sequence, of structure and of function similarities. A com- 
parison between different classification schemes can help understanding the optimal interplay 
between different levels, it can reveal possible misclassification, and it can ultimately offer a 
fully automated updating procedure. Manual steps can be automated in an ever-increasing 
way by using the tools made available by other databases. 

In this work we showed that it is possible to automatically predict the CATH topology 
and the SCOP fold from the FSSP Z-scores. It is possible to submit a protein of unknown 
CATH or SCOP classifications but known FSSP Z-scores to the web site to obtain its 
CATH and SCOP classifications. Since the FSSP database is updated weekly, our procedure 
offers the possibility to update also CATH and SCOP with the same frequency (at least 
down to the topology and fold level, respectively). We introduced a classification method 
that clusters together structures of known and unknown classification according to their Z- 
scores. When proteins outside the twilight zone for structure comparison are considered our 



method is highly rehable. We suggest that, in order to classify proteins within the twilight 
zone, other classification criteria, based on sequence and function similarity, must be adopted. 

The advent of genome projects is multiplying the efforts in the field of protein classifica- 
tion. In the past the aim was to help finding the structure of the particular protein which was 
interesting at a given time. Now the hope is to find a large representative set of structures 
which can encompass most of the existing folds, possibly all of them In such a large 
scale project human intervention, which is precious in setting the principles of classification, 
should be gradually replaced by automated procedures. 
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MATERIALS AND METHODS 

Databases and protein sets 

As the CATH and SCOP databases classify domains, whereas FSSP deals with chains, we 
considered only chains which form a single domain and, therefore, these proteins appear as 
a single entry in the three databases. Several groups have developed methods to identify 
protein domains |20, 23,^-35 1. In this work, we used the Dali Domain Dictionary to 



identify single domain proteins. 

We used the following databases. The CATH release 1.7 which contains 15802 protein 
chains, among which 10906 are classified as single domain. This latter set is called PCs. We 
also used the CATH release 2.0 which contains 20780 protein chains, among which 14389 are 
single domain (PC2s). The SCOP release 1.53 which contains 20021 protein chains, among 
which 15375 are single domain (PSs). The FSSP release from Jan 14, 2001 which contains 
22660 protein chains (PF). The FSSP proteins are grouped into 2494 homology classes such 
that within a class the sequence similarity is above 25%. One protein per class is selected as 
representative and we call PFr the set of all representatives. All the protein sets and their 
sizes are listed in Table 0. 



The Classification by Optimization (CO) method 

The classification scheme that we used is based on the minimization of a particular cost 
function, defined as follows (for the case of the prediction of CATH topology; a similar 
definition holds for SCOP folds); Each protein is assigned an integer number Cj, describing 
its topology (1 to 305). We assign to proteins with known classification the value of c(z) 
determined by their CATH classification. To the yet unclassified proteins we assign initially 
random values from 1 to 305. A cost is calculated for each configuration C = {cj} of topologies 
which penalizes the assignment of different topologies to any pair of proteins. The value of 
this penalty is chosen to be the similarity measure Zij between proteins i and j; the higher 
the similarity Zij, the more costly it is to place proteins i and j in different topologies. The 



cost function is defined as the sum of penalties for all protein pairs 

E{C) = Y.Z,,[l-5[c,,Cj)] . (1) 

The classification problem is stated as finding the minimal cost configuration of the unclassi- 
fied proteins, while keeping the topologies (i.e. the Cj values) of the classified proteins fixed. 
This problem corresponds to finding the ground state of a random field Potts ferromagnet. 

We search for a classification C of minimal cost by an iterative greedy algorithm described 
in detail elsewhere The algorithm identifies at which iteration, if any, it performed a 
heuristic decision. For low fractions of unknown topologies the algorithm usually reaches the 
global minimum of the cost function. 

Bounds on the success rate of the prediction 

In this section we establish a statistical upper bound for the prediction success rate relevant 
to a family of prediction algorithms. 

The Z-matrix can be reinterpreted as a weighted graph; each vertex in the graph represents 
a protein and the weights on the edges connecting two vertices are the corresponding Z-scores. 
Edges with Z < 2.0 are absent from the graph. Following this representation, we define 
two proteins as neighbors if they are connected by an edge. By analyzing the connectivity 
properties of set PC we make inferences about our predictive power. 

One can characterize the FSSP-based neighborhood of a protein according to the CATH 
classification of itself and its neighbors. Every protein must belong to one of four categories: 

(A) "Island" - The protein has no neighbors; 

(B) "Colony" - It has no neighbors of its own kind; 

(C) "Border" - It has neighbors of its own kind as well as of other kinds; 

(D) "Interior" - The protein has only neighbors of its own kind. 

Using these definitions we can arrange the proteins of PC in groups according to their neigh- 
borhood category at the class, architecture and topology levels. The distribution of the 
proteins among these groups can be used to calculate an upper bound for the CO method, 
if we assume that the set of unclassified proteins has the same distribution as the classified 
ones. For example, islands cannot be classified and are therefore rejected. Colonies are bound 
to be misclassified since none of their neighbors give a clue on their type. Since the fraction 
of proteins in each category was estimated on the basis of a sample, it can be interpreted 
only as a statistical upper bound. 

We consider the set PFCs to obtain a first type of upper bound for the success rate of the 
CO method. This set (see Table |3|) is formed by 10541 proteins, among which 5% are islands, 
a negligible fraction (0.2%) are colonies, 6% are borders and 88% are interiors. Therefore the 
upper bound that we found is about 95% for predicting the topology level in CATH. 

The actual prediction performed in this work is done on the set PFC which is formed by 
the 7431 proteins that are in FSSP (14 Jan 2001) but not in CATH1.7 (see Table |). Within 
PFC there is a subset of 1617 (about 22%) proteins which are either islands or super-islands, 
i. e. they are connected only with other proteins in the subset and therefore they have no 
connection to proteins with known classification. Thus, the upper bound for this second type 
of prediction is about 78%. 

Evaluating a classification prediction algorithm 

Since an algorithm can output either a predicted classification or a "rejection", if it does 
not have any prediction, one has to estimate two probabilities: Psuccess and Preject- Robust 
estimation of these parameters is produced by cross-validation a procedure which consists in 
averaging over many (T) randomly sampled test trials. In each trial, the set is divided into 
two subsets; one is used for training the algorithm and the other set, of Ntest proteins, is used 



to test the algorithm by comparing its prediction to the true classification. The probability 
estimates are given by 

p — 1 /T success 

^success — -'-/-i / , AT- V y 

^^test 

T 

p _ 1 _ p _ 1 /T^ ^tesf — Nreject (n\ 

^non— reject — -'^reject — ^1 / , at V"/ 

^^test 

(4) 

Another figure of merit, the purity Ppure, is the probability of correctly classifying non-rejected 
proteins. It is estimated by 

P 

A success /[-\ 

J- ^reject 
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Abbreviation 


Definition 


3Dee 


Database of Protein Domain Definitions 


ASTRAL 


The ASTRAL Compendium for Sequence and Structure Analysis 


CATH 


Protein Structure Classification 


CO 


Classification by Optimization 


DALI 


Protein Structure Comparison by Alignment of Distance Matrices 


DHS 


Dictionary of Homologous Superfamilies 


FSSP 


Fold classification based on Structure-Structure alignment of Proteins 


HOMSTRAD 


HOMologous STRucturc Alignment Database 


MMDB 


Molecular Modeling DataBase 


PDB 


Protein Data Bank 


SCOP 


Structural Classification of Proteins 


SSAP 


Structure Comparison Algorithm 



Table 1: Abbreviations and definitions. 



Original Classification 


Assigned classification 


percent of cases 


Mainly Alpha 






1.10 Orthogonal Bundle 


1.10 Orthogonal Bundle 


96.3% 




Reject 


3.3% 


1.20 Up-down Bundle 


1.20 Up-down Bundle 


97.7% 




4.10 Irregular 


1.2% 




1.10 Orthogonal Bundle 


0.7% 


1.25 Horseshoe 


1.25 Horseshoe 


100.0% 


1.50 Alpha/alpha barrel 


1.50 Alpha/alpha barrel 


100.0% 








2.10 Ribbon 


2.10 Ribbon 


93.9% 




Rej ect 


5.7% 


2.20 Single Sheet 


2.20 Single Sheet 


97.27o 




Reject 


2.3% 


2.30 Roll 


2.:«) Roll 


97.2%: 




Reject 


2.1',-; 




'■',.!{) Roll 


0.7'/; 


2.40 Barrel 


2.40 Barrel 


91.0% 




Reject 


8.8% 


2.50 Clam 


2.50 Clam 


94.4% 




2.40 Barrel 


6.6% 


2.60 Sandwich 


2.60 Sandwich 


86.1% 




Reject 


13.9% 


2.70 Distorted Sandwich 


2.70 Distorted Sandwich 


96.1% 




2.60 Sandwich 


3.9% 


2.80 Trefoil 


12. SO Tref..il 


1110.0';.; 


2.90 Orlhogonal Prism 


2.9(J Oilhogonal Prism 


luu.ij'/;. 


2.100 Aligned Prism 


2.100 Aligned Prism 


100.0% 


2.102 3-layer Sandwich 


2.102 3-layer Sandwich 


78.6% 




2.30 Roll 


21.4% 


2.110 4 Propellor 


2.110 4 Propellor 


100.0% 


2.120 6 Propellor 


2.120 6 Propellor 


96.1% 




Reject 


3.9% 


2.130 7 Propellor 


2.130 7 Propellor 


100.0% 


2.140 8 Propellor 


2.140 8 Propellor 


85.3% 




Reject 


14.7% 


2.160 3 Solenoid 


2.160 3 Solenoid 


100.0% 


2.170 Complex 


2.170 Complex 


83.3%o 




2.60 Sandwich 


8.6% 




Reject 


8.0% 


Mixed Alpha-Beta 






3.10 Roll 


3.10 Roll 


99.9% 


3.20 Barrel 


3.20 Barrel 


100.0% 


3.30 2-Layer Sandwich 


3.30 2-Layer Sandwich 


93.5% 




Reject 


6.0% 


3.40 3-Layer(aba) Sandwich 


3.40 3-Layer(aba) Sandwich 


96.1% 




Reject 


3.8% 


3.50 3-Layer(bba) Sandwich 


3.50 3-Layer{bba) Sandwich 


72.1% 




Reject 


27.2% 




3.30 2-Layer Sandwich 


0.7% 


3.60 4-Layer Sandwich 


3.60 4-Layer Sandwich 


99.7% 


3.70 Box 


3.70 Box 


100.0%; 


3.7.-) .".-^iraucled rrc.peller 


li.rr, --^iriuided rr(.|)eller 


11)1). 1)';.; 


3.80 ILirseslicc' 


3.60 Horseshoe 


WilAI'/i 


3.90 Complex 


3.90 Complex 


97.9% 




Reject 


0.7% 


Few Secondary Structures 






4.10 Irregular 


4.10 Irregular 


90.8% 




Reject 


8.3% 




1.20 Up-down Bundle 


0.8% 



Table 2: Summary of a "confusion table". This table summarizes the results about the 
assignments of the CATH architecture for proteins that have already a CATH classification. 
Only cases which occur more than 0.5% are listed. These figures were calculated using 100 
cross-validation runs at 20% dilution. 



Name 



Description 



Size of set 



PF all chains in FSSP (14 Jan, 2001) 

PFr representative chains in FSSP (14 Jan, 2001) 

PC chains in CATH vl.7 

PCs single domain chains in CATH vl.7 

PC2 chains in CATH v2.0 

PC2s single domain chains in CATH v2.0 

PS chains in SCOP 1.53 

PSs single domain chains in SCOP 1.53 

PCsSs single domains chains in SCOP 1.53 and CATH vl.7 

PFrCs single domain chains in CATH that are representatives FSSP (PFr fi PCs) 

PPrSs single domain chains in SCOP that are representatives FSSP (PFr fi PSs) 

PFCs chains in FSSP and single domain in CATH vl.7 

PFSs chains in FSSP and single domain in SCOP 1.53 

PFC chains in FSSP and not in CATH vl.7 

PFS chains in FSSP and not in SCOP 1.53 



22660 

2494 

15802 

10906 

20780 

14389 

20021 

15375 

10197 

860 

1626 

10541 

14716 

7431 

3451 



Table 3: Protein sets and their sizes. 



Chain id 


CATH vl.7 


CATH v2.0 


CATH Prediction 


SCOP 1.53 


SCOP Prediction 


# 


C 


A 


T 


# 


C 


A 


T 


C 


A 


T 


# 


C 


F 


C 


F 


Icuoa 


-1 








1 


2 


60 


40 


2 


60 


40 


-1 






2 


5 



Table 4: The search result when submitting "IcuoA" to the web site 
|http: / / www.weizmann.ac.il / physics / complex/ compphys /f2cs / . This protein was classi- 



fied by neither CATH vl.7 nor SCOP 1.53 which are the basis of our predictions. We 
predicted it to belong to CATH topology 2.60.40 and SCOP fold 2.5. Later it was indeed 
classified by CATH v2.0 as 2.60.40. The -1 in both CATH vl.7 and SCOP 1.53 represents 
that it was not classified by them. 




Figure 1: (a) Z-score matrix between all pairs of proteins in the PPrCs set. A black dot 
represents Z > 2.0. (b) Same Z-score matrix with rows and columns rearranged by using 
the CATH classification (see text). Part (b) shows the underlying order behind the apparent 
randomness of part (a) and illustrates the extent to which the FSSP Z-scores reflect the 
CATH classification. The regions A, B, C and D are discussed in the text. 



Single domain cliains in SCOP fold and CATH topology 
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Figure 2: Translation table from the CATH topology to the SCOP fold and vice versa. Non- 
zero entries of Tij appear as black dots. Tij is proportional to the number of proteins of 
CATH topology i that have a SCOP fold j in PCsSs. 
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(b) 
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Figure 3: (a) Z-score matrix between all pairs of proteins in the combined PFrCs+PFC sets. 
The submatrix in the upper left corner is the reordered Z score matrix of the set PFrCs which 
was already shown in Figure |l]b. The rest of the matrix presents the Z scores for the proteins 
in the set PFC . (b) The same matrix as in (a) with the rows and columns relative to the 
proteins in PFC reordered according to our assignment of their CATH topology. With the 
CO method the original order in the submatrix PFrCs is propagated to the entire matrix. 



Figure 4: (centre) Protein Idhn which has a CATH a (3 3-layer {f3f3a) sandwich Aspartylglu- 
cosaminidase chain B (3.50.11) topology, (left) Protein laSrA which has also a CATH a [5 
3-layer {I3(5a) sandwich Aspartylglucosaminidase chain B (3.50.11) topology and has Z score 
of 9.3 with protein Idhn. (right) Protein lb66A which has a CATH a [3 2-layer sandwich 
Tetrahydropterin Synthase, subunit A (3.30.479) topology and has Z score of 8.7 with pro- 
tein Idhn. This example illustrates how structural information alone might be insufficient to 
provide a clear-cut criterion for the classification of this protein. 



Z>7.5 prob(same topology)=97.5% 



I I Same CATH Topology 
I I Different CATH Topology 




Z>7.5 prob(same fold)=93.5% 



I I Same SCOP Fold 
H Different SCOP Fold 
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Figure 5: Twilight zone for protein structure classification, (a) The number of protein pairs 
of with a given FSSP Z score that have different CATH folds is a rapidly decaying function 
of Z . On the contrary, the number of proteins pairs with the same CATH fold is decaying 
slowly. For Z < 5 there is a non-negligible probability to have different folds. We call this 
threshold the "twilight zone for structure classification", (b) The corresponding histogram 
for SCOP folds. The number of SCOP folds is larger than the number of CATH topologies, 
hence the twilight zone is Z ^ 7. 



